{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center> RegEx in Python</center>\n",
    "\n",
    "![](images/memes/meme3.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting started with RegEx in Python\n",
    "\n",
    "The **[re](https://docs.python.org/3/howto/regex.html)** module provides an interface to the regular expression engine, allowing you to **compile regular expressions into objects and then perform matches with them**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Compiling Regular Expressions\n",
    "\n",
    "Regular expressions are **compiled** into `Pattern` objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.\n",
    "\n",
    "\n",
    "### `re.compile(pattern, flags=0)`\n",
    "\n",
    "Compile a regular expression pattern, returning a pattern object.\n",
    "\n",
    "- The regular expression is passed to `re.compile()` as a **string**. \n",
    "\n",
    "> Regular expressions are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. \n",
    "\n",
    "> Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"hello\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "re.compile(r'hello', re.UNICODE)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- `re.compile()` also accepts an optional `flags` argument, used to enable various special features and syntax variations. [More about flags](http://xahlee.info/python/python_regex_flags.html)\n",
    "\n",
    "<br>\n",
    "\n",
    "In the example below, we use the flag `re.I` (short for `re.IGNORECASE`) to ignore letter case in the regex pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"hello\", flags=re.I)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "re.compile(r'hello', re.IGNORECASE|re.UNICODE)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Performing Matches\n",
    "\n",
    "So, we have created a `Pattern` object representing a compiled regular expression using `re.compile()` method.\n",
    "\n",
    "Pattern objects have several methods and attributes.\n",
    "\n",
    "Here is the list of different methods used for performing matches:\n",
    "\n",
    "\n",
    "<table style=\"border: 1px solid black; font-size:15px;\">\n",
    "<thead>\n",
    "    <th>Method/Attribute</th>\n",
    "    <th>Purpose</th>\n",
    "</thead>\n",
    "    \n",
    "<tbody>\n",
    "<tr>\n",
    "    <td>match()</td>\n",
    "    <td>Determine if the RE matches at the beginning of the string.</td>\n",
    "</tr>\n",
    "    \n",
    "<tr>\n",
    "    <td>search()</td>\n",
    "    <td>Scan through a string, looking for any location where this RE matches.</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "    <td>findall()</td>\n",
    "    <td>Find all substrings where the RE matches, and returns them as a list.</td>\n",
    "</tr>\n",
    "\n",
    "<tr>\n",
    "    <td>finditer()</td>\n",
    "    <td>Find all substrings where the RE matches, and returns them as an iterator.</td>\n",
    "</tr>\n",
    "</tbody>\n",
    "</table>\n",
    "\n",
    "Let us go through them one by one:\n",
    "\n",
    "### `match(string[, pos[, endpos]])`\n",
    "\n",
    "- A match is checked only at the beginning (by default).\n",
    "\n",
    "- Checking starts from `pos` index of the string. (default is 0)\n",
    "\n",
    "- Checking is done until `endpos` index of string. `endpos` is set as a very large integer (by default).\n",
    "\n",
    "- Returns `None` if no match found.\n",
    "\n",
    "- If a match is found, a `Match` object is returned, containing information about the match: where it starts and ends, the substring it matched, and more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"hello\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "match = pattern.match(\"hello world\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0, 5)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "match.span()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "match.start()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "match.end()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.match(\"say hello\", pos=4) is None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.match(\"hello\", endpos=4) is None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `search(string[, pos[, endpos]])`\n",
    "\n",
    "- A match is checked throughtout the string.\n",
    "\n",
    "- Same behaviour of `pos` and `endpos` as the `match()` function.\n",
    "\n",
    "- Returns `None` if no match found.\n",
    "\n",
    "- If a match is found, a `Match` object is returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(4, 9), match='hello'>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.search(\"say hello\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(4, 9), match='hello'>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.search(\"say hello hello\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `findall(string[, pos[, endpos]])`\n",
    "\n",
    "- Finds **all non-overlapping substrings** where the match is found, and returns them as a list.\n",
    "\n",
    "- Same behaviour of `pos` and `endpos` as the `match()` and `search()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['hello', 'hello']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.findall(\"say hello hello\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `finditer(string[, pos[, endpos]])`\n",
    "\n",
    "- Finds **all non-overlapping substrings** where the match is found, and returns them as an iterator of the `Match` objects.\n",
    "\n",
    "- Same behaviour of `pos` and `endpos` as the `match()`, `search()` and `findall()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "matches = pattern.finditer(\"say hello hello\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(4, 9)\n",
      "(10, 15)\n"
     ]
    }
   ],
   "source": [
    "for match in matches:\n",
    "    print(match.span())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "say \u001b[43m\u001b[1mhello\u001b[0m \u001b[43m\u001b[1mhello\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "from utils import highlight_regex_matches\n",
    "highlight_regex_matches(pattern, \"say hello hello\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> By now, you must have noticed that `match()`, `search()` and `finditer()` return `Match` object(s) where as `findall()` returns a list of strings.\n",
    "\n",
    "\n",
    "### Note:\n",
    "\n",
    "It is not mandatory to create a `Pattern` object explicitly using `re.compile()` method in order to perform a regex operation.\n",
    "\n",
    "You can direclty use the module level functions such as:\n",
    "- `re.match(pattern, string, flags=0)`\n",
    "\n",
    "- `re.search(pattern, string, flags=0)`\n",
    "\n",
    "- `re.findall(pattern, string, flags=0)`\n",
    "\n",
    "- `re.finditer(pattern, string, flags=0)`\n",
    "\n",
    "and so on.\n",
    "\n",
    "In a module level function, you can simply pass a **string** as your **regex pattern** as shown in the examples below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 5), match='hello'>"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.match(\"hello\", \"hello\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['hello', 'hello']"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.findall(\"hello\", \"say hello hello\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Important Example\n",
    "\n",
    "Consider the example below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "txt = \"This book costs $15.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Search for the pattern `$15`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"$15\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern.search(txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### No match found. Why?\n",
    "\n",
    "`$` is a metacharacter and has a special meaning for regex engine. Here, we want to treat it like a literal.\n",
    "\n",
    "In order to treat a metacharacter like a literal, you need to **escape** it using `\\` character."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = re.compile(\"\\$15\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(16, 19), match='$15'>"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern.search(txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In regular expressions, there are twelve metacharacters that should be escaped if they are to be used with their literal meaning:\n",
    "\n",
    "- Backslash `\\`\n",
    "- Caret `^`\n",
    "- Dollar sign `$`\n",
    "- Dot `.`\n",
    "- Pipe symbol `|`\n",
    "- Question mark `?`\n",
    "- Asterisk `*`\n",
    "- Plus sign `+`\n",
    "- Opening parenthesis `(`\n",
    "- Closing parenthesis `)`\n",
    "- Opening square bracket `[`\n",
    "- The opening curly brace `{`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/memes/meme4.jpg)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
